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ABSTRACT 

Computerized adaptive testing (CAT) is a procedure for administering tests which arc 
individually tailored for each examinee* Although the majority of CATs are based on 
dichotomous item response theory (IRT) models, some researchers have explored the use of 
polytomous IRT models, such as the graded response model and jmtial credit (rc) model, 
in CAT. This study investigated the robustness of a PC model-based CATs ability 
csiiraaliwi to items which did nw fit the PC model. Results showed that for the ?C CAT. 

reasonably accurate ability estimation (reej ^ D.921) may be obtained despite adaptive 

tests which, on average, contained up lo 45% misfitting items. Furthermore, the inclusion 
of misfitting items did not appear to increase the PC CAT test lengths. The benefits of 
polytomous model-based CATs were presented. 



One important and very promising application of item response theory (IRT) is 
computerizcl adaptive testing (CAT). Unlike the conventional paper-and-pcncil tcsi in 
which an examinee is administered all test items. CAT is a procedure for administering 
tests which are individually tailored for each examinee. The advantage of IRT-bascd CAT 
over paper-and-pencil testing have been well documented (e.g.. W^iss, 1982). Although 
not necessary (cf.. De Ayala. Dodd. & Koch. 1990), a CAT system typically uses an IRT 
model in combination with test item characteristics to estimate the examinee's ability. 
Typically, either the three-parameter logistic or Rasch models (e.g.. MnBride & Manin, 
1983; Kingsbury & Houser, 1988) have been used in CAT. Despite research which has 
demonstrated the existence of partial knowledge of the correct answer (e.g.. Levinc & 
Drasgow. 1983; Thissen. 1976), dichotomous models and dichotomous model-based CATs 
operate as if an examinee either knows the correct answe«^ or randomly selects an 
incorrect alternative. 

Some research has explored the benefits and operating characteristics of CATs based 
on polytomous IRT modelr (e.g.. De Ayala. 1989; Dodd. Koch. & Dc Ayala. 1989; Koch & 
Dodd. 1989; Sympson. 1986). In general, these studies have shown that item pools smaller 
than those used with dichotomous model-based CATs have led to satisfactory estimation, 
that the use of the ability's standard error of estimation for terminating the adaptive test 
is preferred to the minimum item information termination criterion, and the use of a 
variable stepsizc instead of a fixed stcpsize tends to minimize nonconvergence of trait 
estimation. In addition, it should be noted that polytomous model-based CAT may be 
used not only with polytomously scored items, but with solely dichoiomously scored 
items, or with a combination of the twn (i.e.. some items are scored polytomously while 
others are scored dichotomously). 
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Polytomous graded models have been used for the assessroent of the clinical 
competence of physicians (Julian & Wright, 19g8). the construction and analysis of 
writing tests (Ackerman, 1986; Polhtt and Hutchinson. 1987), educational diagnosis 
(Adams, 1988), and in CAT for the administration of Likert-type attitude questions and 
personality inventories (Koch & Dodd, 1985; Dodd. 1985; Koch. 1983). Given that, a 
number of aptitude test items have traditionally bran sc<»ied in a graded fashion it is 
reasonable and desirable to expect that CAT implementations in these subjects to 
incorporate a graded scoring system. For instance, statistics, chemistry, and physics 
exams arc typically graded by giving partial credit for some incorrect answers. 
Therefore, it wuJd appear reasonable to expect that the use of partial credit scoring for 
some incorrect answers would enhance the acceptance of CAT in these area. Three 
polytomous graded models whose properties for CAT have been studied arc Samejima's 
(1969) graded response model, the rating scale model (Andrich. 1V78), and Masters' 
(1932) partial credit (PC) model (e.g., Dodd. Koch, & De Ayala, 1989; Koch & Dodd, 1989; 
Dodd, Koch, & De Ayala, in press). 

To obtain the advantages of the PC model (and IRT models in general) there must txr 
satisfactory model*data fit. To the extern thai there is low model-data fit, some or ail of 
the advantages of the model may be lost. Although the assessment of model-daia fit may 
be approached via a number of different techniques (cf., Hambleton & Rogers, 1986; 
Ludlow, 1986; Kingston & Dorans, 1985; Wright & N^asters, 1982; Yen. 1981), one common 
approach is to use fit statistics. 

TTie Rasch pyerspective inyolves retaining only those items which arc found lo fii ihc 
model. Strictly speaking, items which do not fit the model are examined to determine ihc 
cause of misfit and may siill be reuined if it is felt that the misfit is due to a few large 
residuals. Calibration programs for the Rasch family of models traditionally output a 
number of fit statistics, as well as information from other model-dau fit approaches. 

Although Koch and Dodd (1989) and Dodd, Koch, and De Ayala (1989) have invcsiigaicd 
various facets of adaptive testing wilh the PC model (i.e., Hem pool size, siep&izcs, 
information functions), one factor which has not been addressed and which is crucial for 
any implementation is the robusmess of the PC model-based CAT to violations of data fit. 
Because the creation of the item pool involve, the interaction of the subjective 
interpretation of model-data fit as well as logistical and administrative factors, the item 
pool will consist of items which will vary in their degree of fit (or misfit). For instance, 
items may be included in an item pool for reasons of content validity (although the items 
may not i.t well). Therefore, this study addressed how robust was the PC model-based 
CATS ability estimation to the use of items which did not fit th'* models. 
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MODEL 

The PC model is appropriate for items with ordered responses, such as aptitude and 
achievement lest items whose alternatives are inhei^ntly ordered or have been ordered 
according to degree of correctness (e.g., through partial credit scoring). In addition, 
attitude questiomiaires and ratings data may also be fitted by the model. 

The PC model provides a direct expression of the probability of an examinee with 
ability 6 responding in a panicular category. In the PC model the examinee-item 

interaction is modeled as : 



Pxi (6) 



k 



(1), 



where 0 is the latent traits bx^ is the difficulty parameter of the step associated with the 
category score xi; item i has mi categories and Xj^l.^mi. A category score reflects the 
number of successfully completed steps. A "step** is simply a stage required to complete 
an item* For instance, the problem ((6/3)^2)^ is considered to contain three steps 
because Uiere are three separate stages which must be completed (in a specific order) to 

correctly answer the problem (i*e., step 1 : 6/3« step 2 : the addition of 2 to the quotient* 
and step 3 : the squaring of the quantity). For notational conveiaence 1(0 - bx\) where j=() 

is defined as being equal to zero. 

Because the PC model is an extension of the Rasch model it assumes that al) items arc 
equally good at discriminating among examinees* In addition, as a member of the Rasch 
family* the PC model's item and person parameters may be estimated on the basis of the 
existence of sufTscient statistics* Specifically, an examinee's test score contains all the 
information for estimating his or her ability and the items* difficulties may be estimated 
from a simple count of the number of persons completing each *'siep'' of an item. The PC 
model requires that the steps within an item be completed in seq'ience» although the steps 
need not be equally difficult nor ic ordered in terms of difficulty. If an item consists of 
only two categories, then the PC model reduces to the Rasch model. 

METHOD 

Programs: A CAT program was written based on the PC model (PC CAT). The program used 
maximum likelihood estimation (MLE) of ability and item selection was on the basis of 
information. The adaptive testing simulation was terminated when either of two criteria 
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were met : a maximum of thirty items was reached or when a predetermined standard error 
of estimate (SEE) was obtained (SEE termination criteria of 0.20, 0.2S» 030 were used). 
Previous research with polytomous model-based CATs has shown that SEE results in better 
CAT poformance than does the minimum item information criterion (e.g., Dodd, Koch, & 
I>e Ayala, 1989). The initial ability estimate for an examinee was the population's mean 
and a variable stepstze was used for ability estimation when ML£ was not possible. 
Data : One thousand simulees were randomly selected from a N(0«1) disuibuiion (the z- 
scores were considered U) be the simulees' true ability. 6j). The examinees' responses to 

150 5*altermftiive items generated according to a linear factor analytic model (Wherry, 
Naylor. Wherry. & Fallis. 1 965) in which : 

2jj = ajZi^ Vl-l^j^cij (2), 

where zj was examinee i's randomly selected z-score (i.e., 87), aj was item j's factor 
2 

loading, h- was item j's commur*a!ity, z^- was a z-score random number that was generated 

specifically for the error component of item j and examinee i. Subsequent to the 
calculation of ziy zy was compared to pre-specified category boundaries to determine ihe 
category response for examinee i to item j. All factor loadings were uniformly high and 
ranged from 0.62 to 0,85. The category boundaries used may founi in Etodd (1985). 

The use of a linear factor analytic approach for data genera tioi? allowed item 
discriminations to vary and the responses to be a nonogival function of ability (i.e., a 
violation of a fundamental IRT assumption). 

Calibraiion: MSTEPS (Wright. Congdon, & Schultz, 1989) was sed to obtain item paramcicr 
estimates and fit statistics for the PC model. 

Fit Analysis: For the purpose of this study the weighted total fit statistic wai chosen for 
identifying item misfit for the PC model; the weight is the information function and is 
used to reduce sensitivity to outliers (Jniilh, 1988)* 

The original 1000 x 150 data matrix was calibrated and fit statistics were obtained. 
After the elimination of items deemed to show '^significant*' misfit, the data set was 
recalibrated without the misfitting items. Fit was then reexamined and items found to fii 
were retained; their item parameter estimates were used for the item pools. Because 
model-data fit is a matter of degree, various critical values (CV) were used to determined 
whether an item was exhibiting significant misfit. For the PC model the CVs used were 
±2.0. ±3.0, ±4.0, ±5.0 (roughly corresponding to a values of 0,O46. 0.003, less than 
0.0001 » less than 0.0(X)U respectively) and the CVs = ±«' (i,e., all items were considered 
to fit and included in the item ]xh>1). 
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Summary: A 1000 examinee by ISO item data matrix was generated and calibrated. 
Critical values (S levels) were used for identifying misnuing items. Subsequent to the 
elimination of misfitting items, the data were rec Jibrated and reexamined for misfit* 
When no items were foimd to misfit^ the item parameter estimates were used to create a 
CAT item pool; five item pools for 'he PC CAT were created (one corresponding to each CV 
level for each model}. The design a>nsisied of the crossing of the SEE factor (3 levels : 
0.20, 0.2S, 0.10) by the CV factor. For each of the 1(XK) examinees an adaptive test was 
simulated using each item pool for the PC CAT. 

Analysis: The CAT simulations were analyzed by comparing e^h CAT's estimated ability 

A 

(6) with Bj through correlational analysis (l^arson product^moment correlation 
coefficients: rge^)* average absolute differences (AAD)« standardized root mean rquarcd 

differences (SRMSD) and standardized differences between means (SDM) (Ek)ody*Bogan and 
Yen, 1983) where : 




where 8j was the ability estimate for examinee j, O'pj was the known true ability for 

A A 2 

examinee j, N was the number of examinees, 6 j was the mean 9j, 6 was the mean of 6, sg 

A 2 A 

was the variance of 8, sq^ was the variance of 6^. The differences between 6 and 0j as a 
function of Bj were graphically examined (a.k.a.. difference plots). Further, descriptive 

statistics were calculated on the number of items administered, the item pools, the 
proportion of misfitting items administered relative to the use of the most conservative CV 
was obtained (i.e.. CV = ±2.0), and the item pools' estimated information functions was 
inspected. 
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RESULTS 

Cclibration and Fit Analysis 

For the PC calibration 33, 5U 63« and 78 items were found to fit the PC znixlcl using 
the CV$ of ±2*0, ±3.0, ±4.0, ±^.0. re^ctively. The nomenclature for the corresponding 
item pools is : model + the number of items in the pool (e.g., PC 33 is the pool for the PC 
model containing 33 iiems and based ra CV & ±2.0). 

The PC 33% 51% 63% 78% and lSO**item pools had step difTicuhy estimates which 
ranged from -2.50 to 3.03, -2.38 to 3.14. -2.35 to 3.13. -2.44 to 2.97. -3.0 to 3.31. 
respectively. Figure 1 shows the total item pool infcrmation for the PC 33 . 51-. 63-. and 
78*item pools. 

Insen Figure 1 about here 

CAT Simulaiions 

For the PC CAT simulations the correlation coefficienls between 8 and 8j increased as 

the SEE termination criterion decreased (see Table 1). All correlation coefficienls were 
equal to or greater than 0.87 and the corresponding scatierplots showed strong linear 
associations. As can be seen even with the 33 item pool there was a strong linear 
association between 8 and 8^. Becoming less conservaiive with respect to the magnitude of 
the CV (up to about ±4.0) produced rQOjS of more or les^ comparable magnitudes to those 
obtain with CVs of ±2.0 and an increase in the number of examinees whose ability 
estimates were considered reasonable. 

Insen Table 1 about here 

Difference plots (i.e.. 9 -Bj bs u function of Bj) for selected PC CATs aic presented in 

Figure 2; these plots arc typical of all the PC CAT plots. As can be seen the PC CATs did 
not tend to either imderestimate or overestimate Bj in a systematic way« In general as 

SEE termination criterion decreased th<: points tended to become less variable about the 
baseline of 0. 

Insert Figure 2 about here 

AAD and SRMSD provide an assessment of the acctiracy of estimation across examirecs. 
while SDM assesses the overall bias between the 8s and 9js. The SRMSD and SDM for the 

PC CATs are presented in Table 2. As can be seen, compared to the use of the ±2.0 CV, 

overall accuracy increased when the CVs of ±3.0 and ±4.0 were used. Regardless of the 

item pool used, the minimal bias exhibited by the PC CAT may not be considered 
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meaningful by some. Although SRMSD and SDM are ^regale indices and therefore, 
coin]>ensation may occur, the difference plots and the AAD indices showed that this was 
not the case. The AAD indict reflected the SRMSD^DM i^ttem, that is, CVs of ±3.0 and 
±4.0 resulted in the smallest AAD. 

Insert Table 2 about here 

Table 3 contains descriptive sutistics on the PC adaptive tests. As would be expected, 
decreasing the SEE termination criterion produrad an increase in ar ^age and median test 
lengths. Similarly, decreasing the SEE termination criterion resulted in an increase in 
the prop^mion of misfitting items administered* Comparing Tables 1 and 3, one sees that 
reoT ^ 0.963 and ree^ ^ 0.9S9 were obtained (based on 98.8% and ^.0% of the examinees* 

respectively) despite the adminisuration of tests containing^ on average* 35*4% (CV - ±3.0) 
and 45.5% (CV = ±4.0) misfitting items. Inspection of plots of the proportion of misfiiiing 
items administered versus showed no systematic relationship. 

h scrt Table 3 about here 
DISCUSSiON 

Using a CV w ±2.0 only 22% of the original items were found to fit the rc model. As 
statH above* each of the 117 items which were found to have significant fit statistics 
would have had to bmi analyzed separately to determine the cause of the misfit. For 
instance* the lOOO examinees could be ordered by their ability and their responses 
examined to see if individuals with abilities above and below the item's location were 
behaving according to expectations. If the majority of the examinees were behaving 
according to how the mcnlel wou!d predict they should and the fit statistic's significance 
could be attritnited to discrepancies in the expectations of a few examinees, then the item 
would be retained and the analysis would proceed to the next misfitting item* Of course, 
with large numbers of examines and a large number of misfitting items this procedure 
would be arduous at best. However, the results showed that strong linear associations 
could be obtained despite the inclusion items which did not fit the PC model at CV « ±2.0. 
In fact, when the entire item poo! was used and with an SEE termination criterion of 0.20^ 
then a fidelity coefficient of 0.945 with comparatively low AAD/SRMSD and SDM values 
was obtained. The tradeoff for being able to include a large number of misfitting items 

A 

was a substantial increase in the number of individuals whose Os were not considered 

A A 

reasonable (i.e., 8 i. -4,0 or 0 i 4.0). 

Given the reej^^ difference plots. SRMSD, SDM. and AAD results for the PC CAT. it 

appears that item pools smaller than are suggested for dichotomous model -based CATs can 
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be used with PC inMiel-l»sed CATs; ihis result repticales Dodd. Koch, & De Ayala (1989) 
and Koch ai^ Dodd's (1989) findings. It ^rpears that reasonably accurate ability 
estimation may be obtained despite adaptive tests which, on Average, contained up to 42% 
misTiiiing items (i.e., the use of CV s 4.0 or less). Furthermore, the inclusion of 
misfitting items did not appear to increase the PC CAT test lengths. 
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Table 1; Pearson product-moment correlation coefTicients between 8 aiul for PC CAT. 







Fit 


Statistics 








+2 0 


iS.O 


±4.0 


±? 0 


All items 


0.30 


0.919 


0.923 


0.921 


0.902 


0.870 


0.25 


0.944 


0.943 


0,943 


0.934 


0.907 


0.20 


0.963 


0.963 


0.959 


0.952 


0.945 


Pool Size 


33 


51 


63 


78 


150 




958 


988 


990 


857 


737 


^refers to 


the niimber 


of cases whose ability estimates fell within 



lahk^: SRMSD, SDM. and AAD for PC CAT 



Fit Statistics 


SEE 


SRMSD 


SDM 


AAD 


±2.0 


0.30 


0.471 


-0.199 


0.345 








n "lis 


U.3Uo 




0.20 


0.363 


-0.212 


0.269 


±3.0 


0.30 


0.405 


-0.065 


0.315 




0.25 


0.356 


-0.071 


0.273 




0.20 


0.295 


■0.076 


0.222 


±4.0 


0.30 


0.406 


-0.039 


0.316 




0.25 


0.353 


-0.035 


0.262 




0.20 


0.304 


-0.045 


0.225 


±5.0 


0.30 


0.50J 


-0.136 


0.343 




0.25 


0.437 


-0.156 


0.292 




0.20 


0.390 


-0.166 


0.259 


All items 


0.30 


0.628 


-0.114 


0.351 




0.25 


0.528 


-0.108 


0.304 




0.20 


0.396 


-0.076 


0.232 
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Tahle V. DescripUve Statistics for PC CAT 



Fit Statistics 


SEE 


Mean N1A« 


Mediait NIA& 


SD NIA^ 


Range 


Proportion^ 




0.30 


836 


7 


3.33 


6-30 






0.2S 


13.01 


n 


5.36 


9-30 


- 




0.20 


21.63 


20 


5.96 


14-30 




±3.0 


0.30 


8.11 


7 


2. 85 


6-30 


0.213 




0.25 


11.79 


10 


4.22 


9-30 


0.288 




0.20 


18.70 


17 


5.27 


14-30 


0.354 


±.4.0 


0.30 


7.89 


7 


2.65 


6-30 


0.375 




0.25 


1 1.2S 


10 


3.53 


9-30 


0.426 




0.20 


17.88 


16 


4.77 


13-30 


0.455 


±5.0 


0.30 


7.69 


7 


2.19 


6-30 


0.460 




0.25 


10.98 


10 


2.94 


9-30 


0.504 




0.20 


17.24 


16 


3.87 


14-30 


0.527 


AH items 


0.30 


7.98 


7 


2.53 


6-26 


0.637 




0.25 


1 1.20 


10 


3.24 


9-30 


0.655 




0.20 


17.31 


16 


3.83 


13-30 


0.671 



^Number of hems administered 

^Proportion of misfilting items administered relative to the use of CV = ±2.0 
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Figure 1, Information funciicn esiUaaies: PC model 33-, 51-. 63-, i id 78-iteni pools 
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Figure 2a. Difference plou (8 - Bj) for the PC CAT: 33-itcm pool, lerraiiunion SEE = 0.30 
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Fipure 2b. Di^erence plois (8 - 87) for the PC CAT: 33-item pool, lenninalion SEE = 0.20 
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Figure 2c. Difference plois (8 - 87) for the PC CAT: 63-iiein pool, lenninaiion SEE = 0. 
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Figure 2d. Differwce plots (8 - 87) for the PC CAT: 150-itcm pool, lermin&.ion SEE = 0.20 
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